Search CORE

136 research outputs found

SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications

Author: Garrison Erik
Lee Wan-Ping
Marth Gabor T.
Zhao Mengyao
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Summary: The Smith Waterman (SW) algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools, but current implementations are either designed as monolithic protein database searching tools or are embedded into other tools. To facilitate easy integration of the fast Single Instruction Multiple Data (SIMD) SW algorithm into third party software, we wrote a C/C++ library, which extends Farrars Striped SW (SSW) to return alignment information in addition to the optimal SW score. Availability: SSW is available both as a C/C++ software library, as well as a stand alone alignment tool wrapping the librarys functionality at https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library Contact: [email protected]: 3 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Graphical pangenomics

Author: Garrison Erik
Publication venue: Biological Sciences
Publication date: 15/10/2018
Field of study

Completely sequencing genomes is expensive, and to save costs we often analyze new genomic data in the context of a reference genome. This approach distorts our image of the inferred genome, an effect which we describe as reference bias. To mitigate reference bias, I repurpose graphical models previously used in genome assembly and alignment to serve as a reference system in resequencing. To do so I formalize the concept of a variation graph to link genomes to a graphical model of their mutual alignment that is capable of representing any kind of genomic variation, both small and large. As this model combines both sequence and variation information in one structure it serves as a natural basis for resequencing. By indexing the topology, sequence space, and haplotype space of these graphs and developing generalizations of sequence alignment suitable to them, I am able to use them as reference systems in the analysis of a wide array of genomic systems, from large vertebrate genomes to microbial pangenomes. To demonstrate the utility of this approach, I use my implementation to solve resequencing and alignment problems in the context of Homo sapiens and Saccharomyces cerevisiae. I use graph visualization techniques to explore variation graphs built from a variety of sources, including diverged human haplotypes, a gut microbiome, and a freshwater viral metagenome. I find that variation aware read alignment can eliminate reference bias at known variants, and this is of particular importance in the analysis of ancient DNA, where existing approaches result in significant bias towards the reference genome and concomitant distortion of population genetics results. I validate that the variation graph model can be applied to align RNA sequencing data to a splicing graph. Finally, I show that a classical pangenomic inference problem in microbiology can be solved using a resequencing approach based on variation graphs.Wellcome Trust PhD fellowshi

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Apollo (Cambridge)

Haplotype-aware graph indexes

Author: Durbin Richard
Garrison Erik
Novak Adam M.
Paten Benedict J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Recommended from our members

Haplotype-aware graph indexes.

Author: Durbin Richard
Garrison Erik
Novak Adam M
Paten Benedict
Sirén Jouni
Publication venue: Bioinformatics
Publication date: 15/01/2020
Field of study

MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

Apollo (Cambridge)

A profile in FIRE: resolving the radial distributions of satellite galaxies in the Local Group with simulations

Author: Bailin Jeremy
Benincasa Samantha
Boylan-Kolchin Michael
Bullock James S.
El-Badry Kareem
Faucher-Giguere Claude-Andre
Garrison-Kimmel Shea
Hopkins Philip F.
Loebman Sarah
Samuel Jenna
Tollerud Erik
Wetzel Andrew
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

While many tensions between Local Group (LG) satellite galaxies and LCDM cosmology have been alleviated through recent cosmological simulations, the spatial distribution of satellites remains an important test of physical models and physical versus numerical disruption in simulations. Using the FIRE-2 cosmological zoom-in baryonic simulations, we examine the radial distributions of satellites with Mstar > 10^5 Msun around 8 isolated Milky Way- (MW) mass host galaxies and 4 hosts in LG-like pairs. We demonstrate that these simulations resolve the survival and physical destruction of satellites with Mstar >~ 10^5 Msun. The simulations broadly agree with LG observations, spanning the radial profiles around the MW and M31. This agreement does not depend strongly on satellite mass, even at distances <~ 100 kpc. Host-to-host variation dominates the scatter in satellite counts within 300 kpc of the hosts, while time variation dominates scatter within 50 kpc. More massive host galaxies within our sample have fewer satellites at small distances, likely because of enhanced tidal destruction of satellites via the baryonic disks of host galaxies. Furthermore, we quantify and provide fits to the tidal depletion of subhalos in baryonic relative to dark matter-only simulations as a function of distance. Our simulated profiles imply observational incompleteness in the LG even at Mstar >~ 10^5 Msun: we predict 2-10 such satellites to be discovered around the MW and possibly 6-9 around M31. To provide cosmological context, we compare our results with the radial profiles of satellites around MW analogs in the SAGA survey, finding that our simulations are broadly consistent with most SAGA systems.Comment: 18 pages, 10 figures, plus appendices. Main results in figures 2, 3, and 4. Accepted versio

arXiv.org e-Print Archive

eScholarship - University of California

Caltech Authors

Genomic diversity and novel genome-wide association with fruit morphology in <i>Capsicum</i>, from 746k polymorphic sites

Author: Albrechtsen Anders
Cardi Teodoro
Colonna Vincenza
D’Agostino Nunzio
Facchiano Angelo
Garrison Erik
Meisner Jonas
Tripodi Pasquale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Capsicum is one of the major vegetable crops grown worldwide. Current subdivision in clades and species is based on morphological traits and coarse sets of genetic markers. Broad variability of fruits has been driven by breeding programs and has been mainly studied by linkage analysis. We discovered 746k variable sites by sequencing 1.8% of the genome in a collection of 373 accessions belonging to 11 Capsicum species from 51 countries. We describe genomic variation at population-level, confirm major subdivision in clades and species, and show that the known major subdivision of C. annuum separates large and bulky fruits from small ones. In C. annuum, we identify four novel loci associated with phenotypes determining the fruit shape, including a non-synonymous mutation in the gene Longifolia 1-like (CA03g16080). Our collection covers all the economically important species of Capsicum widely used in breeding programs and represent the widest and largest study so far in terms of the number of species and number of genetic variants analyzed. We identified a large set of markers that can be used for population genetic studies and genetic association analyses. Our results provide a comprehensive and precise perspective on genomic variability in Capsicum at population-level and suggest that future fine genetic association studies will yield useful results for breeding

Archivio della ricerca - Università degli studi di Napoli Federico II

ZENODO

Copenhagen University Research Information System

The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes

Author: Antunes Lilian
Banks Eric
Challis Danny
Evani Uday S
Garrison Erik
Gibbs Richard A
Marth Gabor
Muzny Donna
Poplin Ryan
Yu Fuli
Publication venue: Digital Commons@Becker
Publication date: 01/01/2015
Field of study

BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users

Springer - Publisher Connector

Digital Commons@Becker

PubMed Central

Recommended from our members

Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph

Author: Durbin Richard
Garrison Erik
Jones Eppie R.
Manica Andrea
Martiniano Rui
Publication venue: Genome Biology
Publication date: 17/09/2020
Field of study

Abstract: Background: During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for aDNA and compare with existing methods. Results: We use vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with bwa to the human linear reference genome. Using vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias but can have lower sensitivity than vg, particularly for indels. Conclusions: Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed

Apollo (Cambridge)

Recommended from our members

Viral coinfection analysis using a MinHash toolkit

Author: Boland Joseph
Castle Phillip E.
Chanock Stephen
Dawson Eric T.
Durbin Richard
Garrison Erik
Lorey Thomas
Mirabello Lisa
Raine-Bennett Tina
Roberson David
Schiffman Mark
Wagner Sarah
Yeager Meredith
Publication venue: BMC Bioinformatics
Publication date: 12/07/2019
Field of study

Abstract: Background: Human papillomavirus (HPV) is a common sexually transmitted infection associated with cervical cancer that frequently occurs as a coinfection of types and subtypes. Highly similar sublineages that show over 100-fold differences in cancer risk are not distinguishable in coinfections with current typing methods. Results: We describe an efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data. rkmh makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage. We show that rkmh is capable of assigning reads to their HPV type as well as HPV16 lineage and sublineages. Conclusions: Accurate read classification enables estimates of percent composition when there are multiple infecting lineages or sublineages. While we demonstrate rkmh for HPV with multiple sequencing technologies, it is also applicable to other mixtures of related sequences

Apollo (Cambridge)